feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults by zac-williamson · Pull Request #23341 · AztecProtocol/aztec-packages

zac-williamson · 2026-05-16T17:59:56Z

Summary

Adds a standalone WebGPU micro-benchmark page (bench-field-mul.html + headless Playwright driver) that compares three BN254 Montgomery product implementations for chained-mul throughput:

Variant	Path	Median (n=2²⁰, k=100)	Δ vs cios
cios	u32 (20×13-bit)	~109 ms	baseline
karat	u32 (20×13-bit)	~80 ms	−27%
sos3uv3	f32 (12×22-bit)	~79 ms	−28%

`karat` (u32, the main win)

Recursive Karatsuba (20×20 → 10×10 → 5×5) over unsigned 13-bit limbs, with Yuval reduction using precomputed r_inv = W⁻¹ mod p. Nine 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels. Zero drains in the multiply phase: a single pp_cr_C slot overflows u32 by ~1.25×, and the wrap unwinds correctly through subsequent unsigned subtraction (algebraic identity P_mid[m] = Σ(x_lo·y_hi + x_hi·y_lo) is non-negative per limb at lazy values). Fully unrolled via mustache so all indices are compile-time constants — naga SROAs the temp slots into registers instead of thread-private memory.

`sos3uv3` (f32, kept as reference)

22-bit f32 limbs with separate per-slot tlo[k]/thi[k] accumulators that break the inner-j carry chain. Each j writes unique tlo[j-1] and thi[j] so there's no overlap or RAW dependency across iterations. Single drain at end of each outer iter via bias_split_f32_le4w. The 22-bit width buys an exact 4-way sum (4·W = 2²⁴ fits in the f32 mantissa).

Test plan

yarn install to pick up playwright-core devDependency.
yarn generate:wgsl && yarn build:esm.
Start dev server: cd barretenberg/ts && ./node_modules/.bin/vite --config dev/msm-webgpu/vite.config.ts --no-open.
CLI bench: node barretenberg/ts/dev/msm-webgpu/scripts/bench-field-mul.mjs --path u32 --n 1048576 --k 100 --variant karat --validate-n 1024 --reps 6 (and --variant cios, and --path f32 --variant sos3uv3). Each should print VALIDATION OK and a timing reps=…median=… line.
Browser bench: open http://localhost:5173/dev/msm-webgpu/bench-field-mul.html?path=u32&variant=karat&n=1048576&k=100&validate-n=1024&reps=6 and read window.__bench.

…ults Adds a standalone WebGPU micro-benchmark page comparing three BN254 Montgomery product implementations for chained-mul throughput: - cios (u32): mitschabaude runtime-loop CIOS over 20×13-bit limbs. Baseline, ~109 ms at n=2^20, k=100. - karat (u32): recursive Karatsuba + Yuval reduction. 9 5×5 schoolbook sub-sub-products are computed independently and combined via two Karatsuba levels; reduction uses precomputed r_inv = W^-1 mod p with zero drains in the multiply phase (unsigned wrap unwinds via subsequent subtraction). ~80 ms (~28% faster than cios). - sos3uv3 (f32, reference): 22-bit f32 limbs with separate per-slot tlo/thi accumulators that break the inner-j carry chain. Single drain per outer iter via bias_split_f32_le4w. ~79 ms. The bench harness: - bench-field-mul.html is a standalone page; reads ?path=u32|f32 &n=N&k=K&validate-n=N&reps=R&variant=V from the URL. - bench-field-mul.ts runs k chained Mont mults per thread, validates the first `validate-n` outputs against a host BigInt reference, and writes timing into window.__bench. - scripts/bench-field-mul.mjs is a Playwright driver for headless invocation from the CLI (added playwright-core as devDependency).

socket-security · 2026-05-16T18:02:14Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	npm/playwright-core@1.59.1

View full report

Routes the `montgomery_product_funcs` mustache partial through a pre-rendered Karatsuba+Yuval body in every MSM shader that does a base-field multiply (15 callsites: convert_points, smvp, horner, batch_affine_{apply,schedule,finalize_*,init,apply_scatter}, batch_inverse{,_parallel}, bpr, decompress_g1, montgomery_parity). The Karatsuba body benches ~27% faster than the mitschabaude runtime-loop CIOS at n=2^20, k=100 (80 ms vs 109 ms). It exposes the same `fn montgomery_product(x, y) -> BigInt` symbol plus the same `get_p` / `conditional_reduce` helpers and uses the same 20×13-bit limb layout, so the swap is a drop-in change with no callsite churn. The field-mul bench retains both options (`?variant=cios` renders the original template inline, `?variant=karat` reuses the class-level default) so the two bodies can be compared side-by-side.

Phase 1 LANDED — BY safegcd inversion (fr_inv_by_a, Option A: 20×13-bit, BATCH=26, carry-free apply_matrix): - Production swap-in: wgsl/cuzk/batch_inverse{,_parallel}.template.wgsl call fr_inv_by_a - 1.5× faster than legacy fr_inv (Pornin K=12) at chained-inverse bench - ~8% MSM wall reduction at logN=16 sanity check - TS port (cuzk/bernstein_yang.ts, bernstein_yang_a.ts) + Jest tests (24 passing) - WGSL impls: wgsl/field/by_inverse{,_a}.template.wgsl + wgsl/bigint/bigint_by.template.wgsl Phase 2 EXPLORATORY — multi-window pooled batch_inverse + multi-window BPR: - WPB plumbing in batch_inverse_parallel + dispatch_args + batch_affine.ts - Default WPB=1 (= legacy behavior, no perf change) - BPR_WINDOWS_PER_BATCH knob in bpr_bn254.template.wgsl - Empirical: pooling without growing WG count gives 0% gain — design needs restructure Standalone bench infrastructure: - bench-divsteps, bench-apply-matrix, bench-fr-inv, bench-batch-affine - Each with HTML page + TS dispatcher + Playwright runner under dev/msm-webgpu/scripts/ - profile-sanity.mjs for per-pass GPU time breakdown on the Quick Sanity Check Tree-reduce design (Stage B) for autonomous remote execution: - .claude/plans/msm-tree-reduce.md — full design (adaptive batch sizing, analytical slice partition, 2 distinct phase kernels) - .claude/plans/remote-agent-brief.md — remote agent execution brief Co-authored with Claude.

zac-williamson added 3 commits May 16, 2026 19:40

chore: remove plan files (delivered out-of-band)

84e7e3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341

feat(bb): WebGPU field-mul bench + Karatsuba/sos3uv3 Mont mults#23341
zac-williamson wants to merge 4 commits into
sb/msm-webgpufrom
zw/msm-webgpu-mont-mul-bench

zac-williamson commented May 16, 2026

Uh oh!

socket-security Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zac-williamson commented May 16, 2026

Summary

karat (u32, the main win)

sos3uv3 (f32, kept as reference)

Test plan

Uh oh!

socket-security Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`karat` (u32, the main win)

`sos3uv3` (f32, kept as reference)

socket-security Bot commented May 16, 2026 •

edited

Loading